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Abstract 

Memory latency is becoming an increasingly important performance bottleneck, especially in multiprocessors. 
One technique for tolerating memory latency is multithreading , whereby we switch between threads upon 
expensive cache misses. In contrast with previous work on multithreading, we explore a new approach that 
is software-controlled rather than hardware-controlled. To implement software-controlled multithreading, 
we use informing memory operations to quickly trap upon cache misses to a miss handler which performs 
the actual thread switching in software. Our experimental results demonstrate that software-controlled 
multithreading can result in significant performance gains on a shared-memory multiprocessor, with the 
majority of applications speeding up by 10% or more, and one application speeding up by 16%. In addition, 
we find that by selectively applying a register partitioning optimization to reduce the thread-switching 
overhead, we can increase the overall speedups to as much as 25%. Given the much simpler hardware 
support required by our scheme, and the fact that its software overheads are expected to become less 
and less expensive over time relative to memory latencies, software-controlled multithreading is attractive 
alternative to traditional hardware-based schemes. 
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1 Introduction 


Memory latency is a key performance bottleneck in modern microprocessor-based systems. As we look to the 
future, the relative importance of memory latency is expected to increase as the gap between processor and 
memory speeds continues to grow, and as wider-issue processors increase the effective performance penalty 
of each cycle of latency. While memory latency presents a challenge for all systems, the problem is especially 
acute in large-scale shared-memory multiprocessors, where accesses to remote memory locations can suffer 
latencies on the order of hundreds of cycles [9]. Although cache hierarchies are an essential first step toward 
coping with this problem, they are not a complete solution. To further tolerate latency, one attractive 
technique is to use a form of multithreading [1, 15, 18] whereby a long-latency access from one thread is 
overlapped with the computation from other parallel threads. (Note that throughout the remainder of this 
paper, we will use the term “multithreading” to refer to multithreading for the sake of latency tolerance, as 
opposed to more general forms of multithreading.) 

1.1 Previous Work on Multithreading 

Several researchers have proposed and evaluated hardware-based multithreading schemes in the past [1,2, 
8, 15, 18]. These schemes can be broken down into roughly three categories: fine-grained , coarse-grained , 
and simultaneous multithreading. 

The idea behind fine-grained multithreading —as exemplified by the HEP [15] architecture—is to uncon¬ 
ditionally switch between threads at a very fine granularity: i.e. once every cycle. The advantage of the 
fine-grained approach is that since the hardware knows ahead of time that a thread switch will occur on 
every cycle, the pipeline can be designed such that there is minimal switching overhead. The disadvantage 
of this approach, however, is that it relies on having a large number of parallel threads to keep the pipeline 
full. For applications with only limited amounts of thread-level parallelism (e.g., when there is only a single 
thread), the performance tends to suffer relative to a conventional, non-multithreaded processor, since each 
thread can utilize only a small fraction of the processing resources. 1 

Rather than switching between threads on every cycle, the idea behind coarse-grained multithreading is to 
allow a given thread to continue running (with the full processor to itself) until it encounters a long-latency 
operation; only at that point does the processor switch to executing another thread. An example of this 
coarse-grained approach is the MIT APRIL architecture [1], In contrast with the fine-grained approach, 
coarse-grained multithreading offers better single-thread performance and requires a smaller number of par¬ 
allel threads to hide latency. The disadvantage of this approach, however, is that since cache misses are 
detected relatively late in the pipeline, the minimum thread switching time is non-trivially large. Hence this 
scheme is not appropriate for hiding short latencies (e.g., primary cache misses which are satisfied by the 
secondary cache), and it is primarily used to hide the large latencies found in shared-memory multiprocessors. 

Finally, a more recent proposal known as simultaneous multithreading [18] leverages the register re¬ 
naming mechanism within dynamically-scheduled superscalar processors to allow instructions from multiple 
threads to be active simultaneously within the pipeline. The advantage of the simultaneous multithreading 
approach—and of an earlier technique called interleaving [8]—is that it enjoys good single-thread performance 
without paying a significant thread switching penalty. 

A common feature of all of these multithreading techniques is that the decision of when to switch between 
threads and the actual switching itself is controlled entirely by hardware. As a result, a non-trivial amount of 
hardware support is required to manage the multiple threads. For example, to minimize the thread switching 
latency, coarse-grained multithreaded processors typically replicate key per-thread state such as the register 
file [1]. Under simultaneous multithreading, the concept of “thread switching” is effectively eliminated at 
the point where instructions reach the functional units—i.e. when they are buffered in dynamic instruction 
scheduling queues—since register renaming has already isolated the effects of independent threads. However, 
simultaneous multithreading does require some non-trivial hardware support to fetch, issue, and retire in¬ 
structions from multiple threads properly. More importantly, simultaneous multithreading requires a larger 
register file to accommodate the multiple threads, and this is likely to increase register access latencies and 
possibly add additional stages to the pipeline [17]. Concern over the potential impact of multithreading 
hardware support on single-thread performance may be a contributing factor to why we have yet to see 
hardware-based multithreading in commodity microprocessors. 

^his problem is exacerbated by the fact these types of machines often do not have data caches or pipeline interlocks. 
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Rather than relying on specialized hardware support, an alternative approach is to use software to 
implement multithreading. The advantage of this approach is that there is obviously no degradation in 
single-thread performance (since the processor is not modified); the disadvantage, however, is that the thread 
switching time is significantly larger than when it is accelerated by special hardware support, and this may 
limit the types of latency that can be successfully hidden. Previous studies have considered purely software- 
based multithreading in the context of hiding remote latencies in software distributed shared memory (DSM) 
machines [12, 16]. Purely software-based multithreading makes sense for software DSMs for two reasons: 
(i) software is already invoked upon the start of a remote access, and therefore it knows when to initiate a 
thread switch; and (ii) remote access latencies are so large in software DSMs [3] (typically several orders of 
magnitude larger than in hardware DSMs [9]) that the overhead of switching threads in software is small by 
comparison. As a result, both the Mowry et al. [12] and Thitikamol and Keleher [16] studies found positive 
results when using software-based multithreading to hide the large remote latencies in software DSMs. 

An open research question is whether software-based multithreading can successfully tolerate more modest 
forms of latency, such as the remote latencies in hardware DSMs (e.g., the SGI Origin [9]). To implement 
software-based multithreading, we need two software mechanisms: (i) the ability to switch between threads; 
and (ii) a mechanism for knowing when to trigger thread switches. The former mechanism is clearly feasible, 
since software can save and restore all thread-specific state (e.g, registers, the program counter, any condition 
codes, etc.). The latter mechanism, however, had been lacking in the past, since there was no way for software 
to directly observe and react to cache misses in a sufficiently lightweight fashion. (Note that the signal handler 
mechanism used to trigger thread switches in software DSMs is not applicable to cache misses, since it is 
too costly and can only react to page-level access violations.) Fortunately, a mechanism which provides this 
functionality was recently proposed by Horowitz et al. [5, 6]: informing memory operations. 

1.2 Informing Memory Operations 

The idea behind informing memory operations [5, 6] is to make cache misses directly observable to software, 
and to enable software to quickly react to these misses. In essence, an informing memory operation consists 
of a memory operation that is combined—either implicitly or explicitly—with a conditional branch-and-link 
operation where the branch is taken only if the reference suffers a cache miss. Horowitz et al. [5, 6] describe 
two possible implementations of informing memory operations: one based on branching on a cache-outcome 
condition code, and another based on a low-overhead trap. 

The low-overhead trap approach works as follows. Two new user-visible registers are added to the 
architecture: (i) a Miss Handler Address Register (MHAR), which contains the address of the miss handler 
to be invoked upon a cache miss (setting this register to zero disables the trapping mechanism); and (ii) a 
Miss Handler Return Register (MHRR), which contains the return address for resuming execution at the end 
of the trap (i.e. it contains the address of the instruction following the memory reference that missed). Upon 
a cache miss, if the MHAR contains a non-zero value, then a branch-and-link occurs to this address, and the 
MHRR is set appropriately. Unlike traditional trapping mechanisms, this one is extremely lightweight since 
it occurs entirely at the user level (no operating system code is executed), and the only state that is saved is 
the MHRR. In other words, the run-time overhead is comparable to a traditional branch-and-link instruction, 
rather than a traditional trap. The authors demonstrate how this mechanism can be implemented within 
modern in-order and out-of-order superscalar pipelines without much additional complexity, since the bulk 
of the necessary hardware support already exists for handling branches and exceptions. The advantage of the 
low-overhead trap approach is that it potentially incurs no overhead on cache hits (unlike the cache-outcome 
condition code approach, which requires an explicit branch to test the condition code even on cache hits). 
Hence we will focus on the low-overhead trap approach throughout the remainder of this paper. 

There are a number of applications of informing memory operations. For example, since they can be used 
to collect memory performance information accurately and with little Overhead, informing memory operations 
enable a wide range of new performance monitoring tools which can guide either the programmer or the 
compiler in identifying and eliminating memory performance problems. In addition, Horowitz et al. [5, 6] 
also demonstrated that informing memory operations can automatically enhance the performance gains from 
software-controlled prefetching [10, 11, 13], and that they can accelerate software-based cache coherence with 
fine-grained access control [14]. The authors also suggest that informing memory operations could be used 
to implement software-controlled multithreading, but there has been no detailed study of this approach until 
now. 
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1.3 Objectives of This Study 

In this paper, we perform a detailed evaluation of whether software-controlled multithreading based on 
informing memory operations can successfully improve the performance of parallel applications running 
on shared-memory multiprocessors with hardware cache coherence. In addition to evaluating our baseline 
scheme, we also investigate a number of extensions which are designed to further enhance the performance 
of software-controlled multithreading. 

We focus on hardware DSMs rather than uniprocessors for two reasons. First, since applications written 
for hardware DSMs already contain parallel threads, it is straightforward to extract the additional parallel 
threads necessary for multithreading. (In contrast, the bulk of applications run on uniprocessors contain only 
a single thread, and parallelizing them is a non-trivial effort.) Second, hardware DSMs tend to suffer more 
from memory latency than uniprocessors—due to the large latency of remote accesses and the additional cache 
misses due to communication patterns—and therefore they are an important target for latency tolerance. 
If software-controlled multithreading on hardware DSMs is successful, then we get the best of both worlds: 
the benefits of multithreading when it pays off, and maximum single-thread performance when it does not. 

The remainder of the paper is organized as follows. We begin in Section 2 by examining the issues involved 
in implementing software-controlled multithreading. Section 3 discusses our experimental methodology, and 
Section 4 presents our experimental results. Finally, we conclude in Section 5. 

2 Software-Controlled Multithreading 

In this section, we discuss the major challenges and tradeoffs involved with implementing software-controlled 
multithreading. We begin by discussing the hardware support necessary for this scheme, and then present a 
design of the miss handler software which performs the actual thread switching. Finally, we discuss how our 
scheme avoids deadlock and handles synchronization events properly. 

2.1 Hardware Support 

The target architecture for our study is a hardware cache-coherent shared-memory multiprocessor com¬ 
prised of out-of-order superscalar processors. For the sake of concreteness, we will use the MIPS R10000 
processor [21] as the basis for our discussion, although similar issues apply to other out-of-order superscalar 
processors. 

Our goal is to support software-controlled multithreading with minimal hardware support beyond in¬ 
forming memory operations. There are three issues, however, which may require some additional hardware: 
the first two involve potential problems that would prevent us from overlapping enough computation with 
the cache miss, and the third involves our ability to selectively switch threads only upon long-latency misses. 

The first obstacle to consider is that when a load suffers a cache miss, it typically cannot retire from 
the reorder buffer until its cache miss has completed. Since all instructions must retire in-order (even in 
an out-of-order issue machine), this means that all instructions executed after the miss (including thread 
switching code and the thread that we switch to) must remain in the reorder buffer until the miss completes. 
The problem is that reorder buffers are typically small (e.g., 32 entries in the R10000) relative to the number 
of instructions that one would need to execute to fully hide a remote cache miss (e.g., several hundred 
instructions in the SGI Origin). Hence the reorder buffer will fill up quickly upon a thread switch, causing 
the processor to stall before it can hide the miss latency. For example, the R10000 does not have sufficient 
buffering to even execute our thread switching code (described later in Section 2.2), let alone the thread 
that we are attempting to activate. To address this problem, we need a mechanism for specifying that the 
load should be allowed to retire, despite the fact that its miss is still in progress. In essence, we would like 
to convert the load into a prefetch , since prefetches can retire before their misses complete. Converting the 
load to a prefetch is acceptable because we do not care about the result of the load—only that it brings the 
line into the cache—since we will resume execution by re-executing the load that missed (as discussed later 
in Section 2.2). While there are a number of ways to accomplish this, one possibility is to set a flag which 
indicates to the trapping mechanism that upon a cache miss, the offending load should be allowed to retire 
(similar to a prefetch). Such an option may be useful in other cases where the miss handler would like to 
execute a non-trivial amount of code underneath the cache miss, and where the miss handler will resume 
execution by re-executing the load which invoked the trap, rather than the instruction which follows it. 
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The second potential problem is that during a thread switch, any use of the load destination register 
(e.g., if we attempt to save it to memory as part of saving the thread state) will result in a data dependence 
that will stall the processor until the load completes. Since we do not care about the result of the load (it 
will be re-executed later), there is no need to save this register value. One software-based solution would 
be to save all registers except the load destination; therefore when the register state of the thread we are 
switching to is restored, the act of overwriting this register will break the original data dependence on the 
load (due to register renaming), thus avoiding a stall. While this approach will work, the problem is how 
to quickly determine which register is the load target (since this information is not readily available inside 
the miss handler) and avoid saving it. One possibility would be to look up this value in a hash table based 
on the return address in the MHRR; however, this will result in non-trivial software overhead. Another 
possibility would be for the hardware to make the destination register number directly visible to the miss 
handler software, perhaps through another special architected register. While this would eliminate the need 
for a hash table lookup, we would still need to branch to a specialized version of the thread switching code 
to avoid saving the given register. The most desirable solution would be for the hardware to automatically 
break the data dependence on the load result when it is marking the load as being able to retire despite its 
outstanding miss (as discussed earlier). In other words, we would like to fully convert the load to having 
the same functionality as a prefetch: i.e. it can graduate immediately, and it produces no register result. 
Breaking this register dependence is realistic for the hardware because the Miss Status Handling Register 
(MSHR) [7]—the structure which tracks an outstanding miss in a lockup-free cache [4]—already maintains 
this register number. In our experiments, we assume that this latter hardware support is available. 

The third area where additional hardware support may be helpful is in identifying (or predicting) whether 
a given cache miss is likely to suffer a large latency. Since multithreading can only improve performance 
if the miss latency is larger than the latency of switching between threads—and since our software-based 
approach requires roughly 55 cycles to switch threads—we cannot hide the latency of primary cache misses 
which hit in the secondary cache. Hence we only want to switch threads upon secondary cache misses (which 
are still large relative to our thread switching time). Ideally, we would like an informing mechanism where 
traps only occur upon secondary misses—however, implementing this may be difficult (or even impossible) 
given how late the secondary cache tags are checked. Instead, we assume that traps can only occur upon 
primary cache misses, but that inside the miss handler we can test a flag which indicates whether the primary 
miss is also a secondary cache miss. 2 This is similar to the condition-code approach that was discussed by 
Horowitz et ah [5, 6]. 

Note that in all three of these cases, the additional hardware support only affects actions taken upon 
miss handler invocation, and there is flexibility in how quickly the actions are performed. Hence we would 
not expect any of these features to slow down the critical path of normal execution. Having described our 
hardware support, we now discuss how it can be used to implement the miss handler. 

2.2 Design of the Miss Handler 

We use a single miss handler to implement multithreading, as shown in Figure 1. The MHAR is set to 
contain this handler address at the start of execution, and is restored after each trap so that we continue 
using this same handler. As we see in Figure 1, the miss handler begins by subtracting four bytes (i.e. one 
instruction word) from the MHRR so that it will eventually restart the thread at the memory reference 
that missed, rather than at the instruction after it. The reason for doing this is that the original reference 
has been converted into a prefetch by the hardware (as discussed in the previous section), and therefore 
the reference must be re-executed to complete properly. The handler then tests whether the primary miss 
was also a secondary cache miss. If so, then the handler switches to a new thread; otherwise, it returns 
immediately. 3 

To switch between threads, the miss handler first saves the state of the current thread to memory, it 
then selects a thread to restart using a simple round-robin scheme, and finally it restores the state of this 
new thread. To prevent the memory references inside the miss handler from triggering additional informing 
memory traps, the trapping mechanism is disabled during the thread switch by writing a zero into the 
MHAR. Since user code in MIPS-based systems does not use the kO register, we use it as a pointer to where 

2 Note that the processor will interlock on this flag until it is available. 

3 Note that the processor will stall until the secondary cache miss flag is valid. If this is likely to take a non-trivial amount 
of time, then some of the thread switching code can be scheduled before this test to avoid wasting time. 
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HandlerAddress: 

add MHRR, MHRR, -4 // 

bne #0, CMF, L2Miss // 

j HHRR // 

L2Hiss: 

li MHAR, #0 // 

li kO, tMembase // 

lw kO, O(kO) 

sw rl, O(kO) // 

sw r2, 4(k0) // 


Point the MHRR to previous inst 
Continue if cache-miss flag is set 
else L2 hit, so just return 

Disable miss-handler 
Get ptr to current state 

Save integer registers 
excluding kO,kl,rO 


sw r31, 112(k0) 

sw fcr31, 116(k0) // Save fp condition code register 

sd fO, 120(kO) // Save fp registers 

sd f2, 128(kO) 


sd f30, 240(kO) 
sw MHRR, 248CkO) 
addu kO, kO, 256 
and kO, kO, tFFFFOFFF 

sw kO, Membase 
lw rl, 0(k0) 
lw r2, 4(k0) 


// Save MHRR 

// Find & save ptr to new context state 
// Assume 16 contexts,256 bytes/context 
// and round robin selection method 

// Restore integer registers 


lw r31, 112(k0) 

lw fcr31, 116(k0) // Restore fp condition code register 

Id fO, 120(kO) // Restore fp registers 

Id f2, 128(kO) 


Id f30, 240(kO) 

Id MHRR, 248(kO) // Restore MHRR 

li MHAR, fHandlerAddress // Re-enable miss-handler 

j MHRR // Jump to new context 


Figure 1: MIPS pseudo-code representation of the miss handler for software-controlled multithreading. 


the thread state is stored. Assuming that the number of active threads per processor is a power of two, 
our simple round-robin scheme requires only three instructions to determine the next thread to be executed. 
Finally, the handler resumes thread execution by jumping to the address in the MHRR. 

As we observe from this code, there are two major dimensions to consider when performing multithreading 
in software: (i) how to manage the saving and restoring of thread state; and (ii) how to decide when it is 
desirable to switch threads. We now consider both of these issues in greater detail. 

2.2.1 Saving and Restoring Thread State 

Our multithreading scheme is similar to coarse-grained hardware-based schemes (e.g., APRIL [1]) in that 
thread switches are triggered by cache misses. An important difference, however, is that these hardware- 
based schemes devote special hardware to quickly saving and restoring the register state of threads. In 
contrast, we must save and restore registers through explicit loads and stores to memory. This overhead 
accounts for the bulk of our thread switching latency (which is roughly 55 cycles). The good news is that the 
thread state tends to stay in the primary data cache, which prevents the latencies from being even larger. 
However, since these non-trivial thread switching times are a potential performance bottleneck, we would 
like to reduce them even further. 

The major trick for reducing the thread switching overhead is to avoid saving and restoring registers 
that do not need to be preserved. As a simple example, some applications do not use floating-point registers 
at all; by recognizing this fact, we could eliminate roughly half of the thread switching overhead in such 
applications. In general, the compiler can determine which registers are live at any given point in the 
program, and it could use this information to select a miss handler that has been customized to only save 
these live registers. While this approach may sound good in theory, it suffers the following limitations in 
practice. First, customizing the miss handler on a reference-by-reference basis involves either setting the 
MHAR before each reference, or else using the MHRR inside the miss handler to hash into a jump table. 
The Horowitz et ah study [5] quantified these types of overheads, which appear to be large enough to 
offset a non-trivial portion of the expected gains. A related limitation is that creating a large number of 
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customized miss handlers will degrade the instruction cache performance. Finally, while it is easy to specify 
which registers are to be saved by choosing the right customized miss handler, it is more difficult to recognize 
which registers are to be restored , since this requires that we recognize the context of the suspended thread. 4 

A simpler approach to reducing the overhead of saving and restoring registers is to statically partition 
the registers between threads. For example, if we wanted to run two threads per processor, the compiler 
could compile each thread to use only half of the user registers. (Note that special-purpose registers—e.g., 
the stack pointer—cannot be partitioned.) The advantage of this approach is that many of the registers 
would be preserved in the register file itself, thus avoiding the need to save them to memory. The main 
disadvantage, however, is that each thread may suffer reduced performance due to having fewer available 
registers. (Another disadvantage is that code replication may impact the instruction cache performance.) 
Rather than taking an all-or-nothing approach, there is in fact a continuum of possibilities between saving 
all registers and partitioning all user registers. For example, it may be beneficial to give each thread one 
additional register at the expense of slightly increased switching overhead. We will evaluate the benefits of 
this static partitioning approach later in Section 4. 

2.2.2 Deciding When to Switch Threads 

The second major challenge for software-controlled multithreading is switching threads only when the miss 
latency is expected to be large relative to the thread switching overhead. For our purposes, this means 
switching only upon secondary cache misses. Unfortunately—as we mentioned earlier—it is not likely that 
the result of the secondary cache tag check will be available early enough to trigger a trap. Instead, the 
strategy which we outlined in Figure 1 is to test whether the primary miss (which triggered the trap) is also 
a secondary cache miss once we are inside the miss handler. The main disadvantage of this approach is that 
if the reference does hit in the secondary cache, then we have wasted overhead with no benefit. 

To avoid this useless overhead, we would like to predict a priori whether a given reference is likely to result 
in an expensive cache miss. If we believe that it will not, then we can disable the trapping mechanism for 
that reference. One possibility would be for the compiler to statically analyze the data locality [11, 19]; this 
technique has mainly been successful at predicting cache misses in matrix-based codes. Another possibility 
would be to collect a profile of how frequently each memory reference suffers a long-latency miss, and to feed 
this information back into the compiler. Finally, another possibility would be to use hardware to predict the 
conditional probability of a reference suffering a long-latency miss, given that it has suffered a primary cache 
miss. Such a prediction mechanism could use techniques similar to those used for branch prediction. With 
this information, the user could specify that they would like informing traps to occur only upon primary 
cache misses which are also predicted to be expensive misses. Implementing this behavior would be feasible 
since both the primary cache miss signal and the “expensive miss” prediction value would be available early 
enough to control the trap mechanism. 

Of course, the drawback of using a prediction mechanism is that if it incorrectly predicts that a miss will 
be inexpensive when it turns to be expensive, then it is too late to invoke the thread switching code to hide 
the miss latency. We will evaluate the potential benefit of such techniques later in Section 4. 

2.3 Avoiding Deadlock and Handling Synchronization Properly 

By interleaving multiple threads on the same physical processor, multithreading introduces the possibility of 
deadlock in two ways. First, a repeated pattern could occur where thread A steals resource X from thread B 
(which is currently suspended, also waiting for resource A"), only to suffer a thread switch back to B before 
A can use X; when thread B restarts, it steals resource X back from thread A, but also switches back to A 
before B can use X, etc. Such a pattern could be repeated infinitely as the two threads rapidly switch back 
and forth but neither thread makes progress. This scenario can arise when multiple threads suffer cache 
misses for unique addresses which map into the same cache entry. To prevent this problem, we swap out 
a given thread only once when it encounters a cache miss. If the miss has not completed by the time the 
round-robin scheduler reactivates the thread, then the thread stalls at that point until the miss completes 
(rather than switching to another thread). 5 Hence forward progress is guaranteed. 

4 One way to implement this would be to save the instruction address of the customized code that should be used to restore 
a thread along with its other register state, and to jump to this address in the process of switching threads. 

5 Although it is not clear how this works from our pseudo-code in Figure 1, the idea is to either postpone turning the 
miss handler back on until after the original reference completes when the thread is restarted (this can be accomplished if the 
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Table 1: Simulation parameters 


Memory Parameters j 

Line Size 

32B 

Instruction Cache 

32KB, 2-way set-assoc 

Data Cache 

32KB, 2-way set-assoc 

Unified Secondary Cache 

2MB, 2-way set-assoc 

Data Cache Banks 

2 

Data Cache Fill Time 
(Requires Exclusive Access) 

4 cycles 

Miss Handlers (MSHRs) — "" 

16 for data, 2 for insts 

Main Memory Bandwidth 

1 access per 20 cycles 

Total Miss Latency to 
Secondary Cache 

14 cycles 

Total Miss Latency to 

Local Memory 

78 cycles 

Total Miss Latency to 

200 cycles (2 hops), 

Remote Memory 

300 cycles (3 hops) 


| Pipeline Parameters [ 

Issue Width 

4 

Functional Units 

2 Int, 2 FP, 

2 Mem, 1 Branch 

Reorder Buffer Size 

32 

Integer Multiply 

12 cycles 

Integer Divide 

76 cycles 

Ail Other Integer 

1 cycle 

FP Divide 

15 cycles 

FP Square Root 

20 cycles 

All Other FP 

2 cycles 

Branch Prediction 

2-bit Counters 


The second scenario which can result in deadlock is if thread A spin-waits for a resource that is held by 
thread B , where B is currently suspended on the same processor as A, and A never yields the processor 
to B in the course of spin-waiting. This scenario can arise with any form of synchronization that involves 
spin-waiting (e.g., locks and barriers). Our solution is to force a thread switch (in software) as part of all 
spin-waiting loops. Not only does this approach avoid deadlock, it also has the added benefit that it helps 
the processor tolerate synchronization latency. 

3 Experimental Framework 

To evaluate our software-controlled multithreading scheme, we performed detailed cycle-by-cycle simulations 
of a collection of seven applications from the SPLASH-2 benchmark suite [20] on a shared-memory multi¬ 
processor with out-of-order superscalar processors similar to the MIPS R10000 [21]. Our simulation model 
varies slightly from the actual MIPS R10000—e.g., we model two memory units, and we assume that all 
functional units are fully-pipelined. However, we do model the rich details of the processor, including the 
pipeline, register renaming, the reorder buffer, branch prediction, instruction fetching, branching penalties, 
the memory hierarchy (including contention), etc. The parameters of our model are shown in Table 1. 

Our multiprocessor system model is roughly based on the SGI Origin [9]. We use a full-map directory to 
implement invalidation-based cache coherence. Remote accesses require either two or three network hops, 
depending on whether the data can be supplied by the home node or whether it must be forwarded from a 
dirty-remote node. We do not model network contention, but we do model memory contention in detail. As 
shown in Table 1, the two and three hop remote accesses result in nominal latencies of 200 and 300 cycles, 
respectively, not including additional delays due to memory contention. 

We would like to emphasize that we simulate the actual thread-switching instructions shown in Fig¬ 
ure 1, rather than simply modeling thread-switching as some fixed latency. In addition, we precisely 
model the timing of the trap mechanism for informing memory operations in the R10000, as described 
by Horowitz et al. [5, 6]. Our thread-switching code consists of a total of 104 instructions—of these, 94 are 
memory references. Given that our processor has two memory units, the memory references alone would 
dictate a minimum thread switching time of at least 47 cycles. Since we also model the instruction and 
data cache misses caused by the miss handler code, data dependences, resource constraints, etc., we ob¬ 
serve a thread switching latency that is closer to 55 cycles. (The actual thread switching time varies across 
applications, and in one case is over 100 cycles, as we will see later in Section 4.) 

We performed our experiments on the following applications from SPLASH-2: Cholesky, Fft, Lu-cont, 
Lu-ncont, Ocean-cont, Ocean-ncont, and Radix. Table 2 briefly summarizes each application, along 
with the input data sets and other statistics. Further details on these applications can be found in the study 

hardware supports sampling counters with the informing memory traps, or by scheduling explicit instructions in the code to 
turn the handler back on after keeping it disabled before restarting), or to combine an explicit test for a partial-latency miss 
with the test for an secondary cache miss inside the miss handler before invoke the thread switch code. In our experiments, we 
model the sampling counter approach. 
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Table 2: Benchmark characteristics table describes the benchmarks, input data set, and cache miss counts. 
The total number of misses, misses that hit in local memory, and remote miss counts are given for the 
2-processor case. 


Name 

Description 

Input Data Set 

Instructions 

Graduated 

Cache Miss Count 

Total 

Local Mem. 

Remote 

Cholesky 

Sparse Cholesky factorization 

tkl4.0 input file 

44.6M 

289K 

28.8K 

22.5K 

Fft 

ID fast Fourier transform 

65536 complex points 

30.1M 

256K 

123.8K 

59.6K 

Lu-cont 

LU factorization with 
contiguous partitions 

512x512 matrix, 

32x32 elem. blocks 

184M 

755K 

45.8K 

50.7K 

Lu-ncont 

LU factorization with 
non-contiguous partitions 

512x512 matrix 

32x32 elem. blocks 

205M 

7508K 

66.7K 

62.7K 

Ocean-cont 

Large-scale ocean simulation 
with contiguous partitions 

130x130 grid 

48.9M 

2009K 

52.2K 

1.5K 

OCEAN-NCONT 

Large-scale ocean simulation 
with non-contiguous partitions 

130x130 grid 

65.6M 

2374K 

284.4K 

15.9K 

Radix 

Integer radix sort 

262144 keys, radix=1024, 
max key value=1024 

25.9M 

197K 

23.2K 

24.4K 
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Figure 2: Performance of the baseline software-controlled multithreading scheme. 


by Woo et al. [20]. All applications were compiled using version 2.8.0 of the gcc compiler, with -03 opti¬ 
mization. We used the MINT3 MIPS instruction interpreter tool (provided by MIPS) to drive our detailed 
performance model, thus allowing us to simulate all instructions (including the thread-switching code) in a 
detailed, execution-driven fashion. 


4 Experimental Results 

We now present results from our simulation studies. We begin by evaluating the performance of our baseline 
software-controlled multithreading scheme. To further improve upon this scheme, we evaluate the perfor¬ 
mance potential of two techniques for reducing overheads: (i) register partitioning to reduce the thread 
switching overheads, and (ii) miss prediction to avoid invoking the miss handler upon secondary cache hits. 

4.1 Performance of the Baseline Software-Controlled Multithreading Scheme 

The results of our first set of experiments can be found in Figure 2 along with Tables 3 and 4. Figure 2 
shows the performance impact of multithreading with two and four threads per processor on a two-processor 
machine, and with two threads per processor on a four-processor machine. 6 Each bar is labeled with the 
number of threads per processor, with the number of processors below that. 

Since MINTS can only simulate up to eight parallel threads at this point, we were not ahle to explore larger machine 
configurations. NOTE TO REVIEWERS: We hope to correct this by the final draft of the paper. By focusing on smaller 
machine configurations, we tend to underestimate the fraction of secondary cache misses that would be remote in a larger 
machine—hence our results are conservative since the potential performance gains are likely to be larger in larger-scale machines. 
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Table 3: Breakdown of the performance of the baseline software-controlled multithreading scheme. Perfor¬ 
mance is normalized to the 2-processor case with no multithreading. Memory stall time is broken down into 
misses found in the L2 cache, local memory, or remote memory, and misses combined with other misses. 


Benchmark 

# of 
Procs 

# of 
Threads 
per 
Proc 

Total 

Exec. 

Time 

j Breakdown of Normalized Graduation Slots || 

Busy 

Inst. 

Stall 

Sync 

Stall 

Stalls Due to LI Misses 
Found in Location Below 

Thread 

Switching 

L2 

Mem 

Rem. 

Comb. 

Setup 

Hegs | 

Cholesky 

H 

1 


■Mil 

mxmm 


mnm 

3.3 

5.0 

22.7 

0.0 

■ran 

2 

■EEES 

11.4 

moncm 



1.1 

0.0 

19.1 

2.1 

■ai 

4 

109.2 

11.0 


45.4 

m 

“1.0 


19.1 

2.1 

-ju. 


1 

84.1 

6.4 




2.5 

4.8 

12.5 

0.0 


2 ; 

87.8 


BMW 

■1.1W 

0.1 

0.5 

0.0 

14.8 

1.9 

5.7 

Fft 

2 

1 : 

IHEEHsll 

\wsEsa 

mmw.m 

■MW 

1.1 

15.5 

iHfOEM 

15.1 


0.0 

2 

89.9 

11.4 

19.0 


HQ 


0.0 

17.5 


18.4 

4 

85.6 

10.1 

KW 

■IM 


IHQ 

0.6 

15.4 

5.2 

18.1 

4 

1 

65.1 

5.6 

977 


KgJ| 

776 

15.0 

6.1 

0.0 

■ran 

2 

IKEBI 

\wmm 

10.0 

19.8 

0.0 

0.0 

0.0 

8.7 

4.2 

mnaa i 

Lu-cont 

2 

1 

100.0 

17.8 

32.0 

14.5 

2.0 

1.2 

3.2 

29.2 

0.0 

0.0 

2 

90.9 

17.5 

29.4 

13.2 

0.1 

0.2 

0.0 

27.2 

1.5 

1.8 

4 

96.8 

16.8 

30.8 

16.9 

0.1 

0.3 

0.0 

27.4 

2.5 

2.0 

4 

1 

61.3 

8.9 

14.9 

16.1 

0.9 

0:8 

5.7 

14.1 

0.0 

0.0 

2 

57.4 

7.9 

15.0 

17.7 

0.1 

0.0 

0.0 

13.8 

1.1 

1.8 

Lu- 

NCONT 

2 

1 

100.0 

19.6 

15.4 

7.6 

23.3 

1.4 

3.4 

29.4 

0.0 

0.0 

2 

87.5 

21.0 

14.0 

7.1 

0.3 

0.2 

0.0 

32.3 

10.4 

2.2 

4 

86.8 

19.8 

13.3 

9.1 

0.3 

0.3 

0.0 

31.9 

9.9 

2.2 

4 

1 

64.1 

11.4 

7.4 

12.6 

8.0 

0.1 

5.8 

187? 

0.0 

0.0 

2 

WHT 

10.1 

7.9 

11.0 

0.2 

0.0 

0.0 

21.3 

4.8 

2.4 

Ocean- 

cont 

2 

1 

100.0 

19.6 

43.7 

4.4 

12.7 

3.8 

0.4 

15.5 

0.0 

0.0 

2 

89.3 

20.4 

39.2 

4.6 

0.0 

0.0 

0.0 

17.5 

4.9 

2.7 

4 

112.9 

20.6 

43.5 

11.0 

0.0 

0.0 

0.0 

28.9 

5.5 

3.4 

4 

1 

50.6 

10.4 

21.1 

3.1 

6.1 

0.7 

0.9 

8.3 

0.0 

0.0 

2 

48.1 

9.5 

19.8 

3.3 

0.0 

0.0 

0.0 

10.5 

2.7 

2.3 

OCEAN- 

NCONT 

2 

1 

100.0 

15.6 

32.1 

6.5 

11.2 

13.9 

1.2 

19.5 

0.0 

0.0 

2 

99.1 

15.6 

31.9" 

6.1 

0.0 

0.0 

0.0 

28.0 

6.1 

11.4 

4 

121.2 

16.0 

35.4 

16.4 

0.0 

0.0 

0.0 

33.8 

6.5 

13.1 

4 

1 

50.9 

8.2 

16.3 

3.0 

4.9 

7.0 

1.0 

9.6 

0.0 

0.0 

2 

48.8 

7.1 

16.2 

3.1 

0.0 

0.0 

0.0 

13.9 

2.7 

5.8 

Radix 

2 

1 

100.0 

18.8 

53.5 

2.5 

2.1 

3.8 

8.1 

11.2 

0.0 

0.0 

2 

99.1 

18.1 

52.9 

1.3 

0.1 

0.2 

0.0 

13.8 

0.9 

11.3 

4 

106.1 

19.0 

54.3 

4.8 

o.i 

0.2 

0.0 

14.4 

1.5 

11.5 

4 

1 

53.6 

9.7 

26.5 

1.6 

1.0 

1.4 

7.6 

5.7 

0.0 

0.0 

2 

53.8 

8.4 

26.0 

0.9 

0.1 

0.1 

0.6 

7.9 

0.6 

9.8 


The execution times are normalized to the case without multithreading on two processors, and they are 
broken down into nine categories explaining what happened during all potential graduation slots. 7 The 
bottom section (Busy) ^is the number of slots when instructions actually graduate. The Mem Stall and 
Sync Stall sections are any non-graduating slots that can be directly attributed to data cache misses or 
synchronization, respectively. Table 3 breaks down the Mem Stall slots further into four categories: the first 
three are when a primary cache miss is ultimately found in the secondary cache, local memory, or requires a 
remote access, respectively; the fourth case (labeled Comb.) is when a primary cache miss is combined with 
another outstanding miss in progress. Returning to Figure 2, the top two sections in the multithreading cases 
represent slots due to the thread switching code; these are broken down into time spent saving and restoring 
registers (TS Regs) and the remaining miss handler time (TS Setup ). Finally, the Inst Stall section is all other 
slots where instructions do not graduate. Note that these categories are only a first-order approximation of 
what is limiting performance, due to the inherent parallelism within an out-of-order superscalar processor 
and the fact that delaying one dependence tends to exacerbate subsequent dependences. 

As we see in Figure 2, software-controlled multithreading results in significant speedups ranging from 10% 
to 16% in four of the seven applications (Fft, Lu-CONT, Lu-NCONT, and Ocean-cont), and more modest 
speedups of 1-2% in the other three cases. We also see that adding more threads does not necessarily improve 
performance. For example, Ocean-cont (on two processors) goes from a 12% speedup with two threads 
per processor to a comparable slowdown with four threads per processor. For all applications, however, there 

7 The number of graduation slots is the issue width (4 in this case) multiplied by the number of cycles. We focus on graduation 
rather than issue slots to avoid counting speculative operations that are squashed. 
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Table 4: Additional statistics on the baseline multithreading scheme. 


Benchmark 

Avg. L2 
Cache Miss 
Latency 
(cycles) 

Average 

Run 

Length 

(cycles) 

Average 
Thread 
Switch Time 
(cycles) 

Cholesky 

127 

826 

71 

Fft 

114 

161 

56 

Lu-cont 

143 

1769 

57 

Lu-ncont 

139 

1366 

53 

Ocean-cont 

84 

1125 

55 

Ocean-ncont 

83 

260 

54 

Radix 

137 

627 

108 


is at least one configuration where software-controlled multithreading improves performance. 

Let us begin by focusing on the impact of multithreading on memory stall times. We observe that 
without multithreading, six of the seven applications (all except Radix) are spending over a third of their 
time stalled waiting for data when running on two processors; in three of these cases (Fft, Lu-ncont, and 
Ocean-ncont), about one-half of execution time is lost to memory stalls. By exploiting 2-way multithread¬ 
ing on two processors, we are able to hide 23% to 63% of the memory stall time; in six of the seven cases, 
multithreading hides over 35% of these stalls. As we see in Table 3, the bulk of the remaining miss latency 
with multithreading is due to misses that combine with other outstanding misses. For these combined misses, 
we are able to partially (but not fully) hide the memory latency. This effect is accentuated in part because 
our simple round-robin scheduling scheme blindly restarts the next thread without taking into consideration 
whether its miss has completed, or whether there are other threads that are ready to run. We chose our 
simple thread scheduling scheme, however, to minimize thread switching overhead and to avoid deadlock. 

The benefit of reduced memory stall times is at least partially offset by the thread switching overheads. 
In four of the seven applications (Cholesky, Lu-cont, Lu-ncont, and Ocean-cont), the switching 
overhead with two threads each on two processors is less than 30% of the original memory stall time; in 
the other three cases, however, this overhead is almost one-half of the original memory stall time. It is 
not surprising that the thread switching times are non-trivially large, given that all of the thread switching 
is performed by software. The good news, however, is that the thread switching times are actually small 
enough that we do see some non-trivial performance gains. For example, even though Fft experiences a 
large thread-switching overhead, it still enjoys a 16% speedup with software-controlled multithreading. As 
we see Figure 2 and Table 3, the bulk of the thread switching overhead is usually due to saving and restoring 
registers, as opposed to other time spent in the miss handler. (The major exception to this is Lu-NCONT, 
where most of the time is spent entering the miss handler and then deciding not to switch threads due to 
the reference hitting in the secondary cache.) Later in this section, we will evaluate techniques for reducing 
this thread-switching overhead. 

We observe that multithreading generally had no positive impact on synchronization stalls. Part of the 
reason for this is that the bulk of the synchronization stalls in these applications are due to barriers. Since 
barrier stall times are dominated by load imbalance, which is not directly improved by latency tolerance, 
there is little opportunity for multithreading to improve their performance. In fact, the synchronization stall 
times become noticeably worse with four threads in several applications due to load imbalance problems. 

To provide further insight into the multithreading behavior, Table 4 shows the following statistics: (i) 
the average secondary cache miss latency, which is the latency that a thread switch attempts to hide; (ii) 
the average run length , which is how long a thread executes between thread switches; and (iii) the average 
thread switching latency. (These numbers were collected from the case with two threads per processor on 
two processors, but the same trends hold in the other multithreading configurations.) First, we observe that 
the average secondary cache miss latency is significantly larger than the average thread switching latency in 
all cases. If this were not true, then the overhead of multithreading would offset any potential gains. Aside 
from the two versions of Ocean (which are dominated by capacity misses, and where there is sufficient 
locality in the data distribution such that most secondary cache misses hit in local memory), the average 
miss latencies in the other applications are over 110 cycles due to the fact that a reasonably large fraction 
of secondary cache misses require remote communication. While five of the seven applications have thread 
switching latencies ranging from 53 to 57 cycles, Cholesky and Radix experience much larger switching 
latencies: 71 and 108 cycles, respectively. These larger switching latencies are primarily caused by the 
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Table 5: Impact of register partitioning on thread switching latencies. 


Benchmark 

{ Avg. Thread Switching Latency (cycles) || 

Baseline Case 

Register Partitioning 

Cholesky 

71 

26 

Fft 

56 

18 

Lu-cont 

57 

20 

Lu-ncont 

53 

21 

Ocean-cont 

55 

20 

OCEAN-NCONT 

54 

19 

Radix 

108 

37 


application displacing the thread switching instructions and data from the caches between thread switches. 

Roughly speaking, we would expect the performance to saturate when the number of additional threads 
beyond the main thread is equal to where T, R , and C are the average miss latency, run length, and 

thread switching latency, respectively. Given the data in Table 4, we would expect to reach this saturation 
point with only one additional thread per processor, which is generally true. The one noticeable exception— 
Fft, which benefits from having four threads each on two processors—is also the case with the smallest 
average run length. 

Finally, we observe that when multiple threads share the same physical cache, they can potentially 
interfere with each other either constructively (by effectively prefetching another thread’s working set) or 
destructively (by displacing another thread’s working set). While we did not observe any cases where destruc¬ 
tive interference was problematic, we did observe a case of positive interference. In Lu-ncont, consecutive 
threads often access the same cache lines. When these threads are on separate processors, this sharing 
pattern results in communication and remote accesses. When consecutive threads are assigned to the same 
processor, however (as occurs under multithreading), one thread effectively prefetches the data set of another 
thread. 

In summary, we have seen that our baseline software-controlled multithreading scheme can yield non¬ 
trivial performance gains. However, a key bottleneck which is limiting further performance improvement is 
the time spent switching between threads in software. To address this problem, we now consider techniques 
for reducing this overhead. 

4.2 Register Partitioning 

As we discussed earlier in Section 2.2.1, one approach to reducing the thread switching overhead is to 
partition the register set between threads, thereby reducing the number of registers that must be saved and 
restored. To perform these experiments, we recompiled each application using the -ffixed flag in gcc to 
control how many user registers could be allocated to a given thread. The following special-purpose MIPS 
registers could not be partitioned, and must still be saved and restored upon a thread switch: at, vO-vl, 
a0-a3, gp, sp, fp, ra and fcr31. By partitioning the remaining registers between threads, we were able to 
reduce the thread switching code to only 34 instructions, 24 of which were memory references. This reduced 
the average thread switching latency to as little as 18 cycles, as shown in Table 5. As we see in Table 5, 
register partitioning reduces the thread switching latency by at least a factor of 2.5 in all cases. 

Figure 3 shows the impact of register partitioning on performance. For each multithreading case, we 
show two bars: the bar labeled B is the base case (shown earlier in Figure 2), and the bar labeled R is the 
case with register partitioning. As we see in Figure 3, the results are mixed. 

In the cases with four threads per processor, register partitioning improves the performance of only one 
application: Fft, which enjoys a 7% speedup. For the other six applications, the negative impact of increased 
register spilling more than offsets the positive impact of faster thread switching. The problem in this case 
is that partitioning the registers between four threads eliminates three fourths of the user registers available 
to a given thread. As threads run for longer periods of time between thread switches, it becomes more 
important to have good register allocation rather than fast thread switching. Hence it is not surprising that 
the one application which actually benefits from four-way partitioning (Fft) also had the shortest average 
run length (as shown earlier in Table 4). 

Register partitioning is more successful when there are only two threads per processor, in part because 
each thread loses only half of its user registers. As we see in Figure 3, two applications (Fft and Radix) enjoy 
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Figure 3: Impact of register partitioning on performance (B = baseline multithreading, R = multithreading 
with register partitioning). Execution times are normalized to the case without multithreading on two 
processors. 

significant performance gains from register partitioning with two threads per processor, and one application 
(Lu-CONT) enjoys a modest speedup. As we saw earlier in Figure 2, Fft, OCEAN-NCONT, and Radix each 
spend over 10% of their time saving and restoring registers to perform thread switches in the baseline case. 
Hence it is not surprising that we see large performance gains due to register partitioning in Fft and Radix. 
In contrast, Ocean-ncont has higher register pressure than either Fft or Radix, and consequently it loses 
too much performance due to register spilling to make up for the faster thread switching time. 

Overall, we see that register partitioning can potentially improve performance by reducing the number 
of registers that must be saved and restored upon a thread switch. For example, in the case of Radix, 
software-controlled multithreading offers almost no speedup on two processors in the baseline case, but it 
enjoys a 7% speedup with register partitioning. However, register partitioning is a technique that must be 
used with caution, since it can hurt performance if it causes too much register spilling. For example, with 
four threads per processor, the penalty of increased spilling due to having only 25% of the original user 
registers almost always outweighs the benefits of reduced switching overhead. Since the decision of whether 
to perform partitioning is controlled by software, the programmer has the flexibility to choose the option 
that works best for a given application. An even better solution would be for the compiler to make this 
decision automatically , which may be feasible since the compiler is aware of register spilling when it performs 
register allocation, and could adjust the degree of partitioning accordingly. 

4.3 Miss Prediction 

The final optimization that we consider is using prediction techniques to avoid invoking the miss handler 
upon primary cache misses which hit in the secondary cache (as discussed earlier in Section 2.2.2). The basic 
idea is to predict the conditional probability of a secondary cache miss given a primary cache miss for a 
specific reference, and to use this information at the time when a primary miss is detected to decide whether 
or not to actually invoke the miss handler. In theory, this could allow us to reduce some of the TS Setup 
time shown earlier in Figure 2. However, based on the results of our experiments, this optimization does not 
appear to be useful in practice. Even with a perfect prediction mechanism, the potential performance gain is 
generally quite small (just a few percent). 8 When we experimented with dynamic hardware predictors (e.g., 
two-bit saturating counters and other mechanisms commonly used for branch prediction), we were unable 
to achieve any speedup over the baseline case. Stride predictors are not helpful, since both the primary and 
secondary caches share the same line size. While it is easy to predict that a large fraction of references will 
hit in the secondary cache (especially those that enjoy spatial locality), most of these references also hit in 
the primary cache, in which case the miss handler would not be invoked anyway. 

The fundamental problem is that accurately predicting the conditional probability of a secondary cache 
miss given a primary cache miss is difficult, and the penalty of a false negative (i.e. failing to predict a 

8 Note that only a fraction of the TS Setup time can be eliminated, since much of it is due to real thread switches. 
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secondary cache miss) is extremely large, since we will fail to hide any of the miss latency in that case. (In 
contrast, the penalty of a false positive is much smaller, since we will quickly discover the mistake after 
entering the miss handler.) Hence all of the realistic predictors that we considered actually hurt performance 
by generating too many false negatives. The lesson that we have learned from these experiments is that it is 
far more important to reduce the overhead associated with actually switching threads (the largest component 
of which is saving and restoring registers) than trying to avoid invoking the miss handler in cases where a 
thread switch is unnecessary. 

5 Conclusions 

In contrast with previous studies on using multithreading to tolerate memory latencies in tightly-coupled 
machines, we have considered a completely new approach: one that is software-controlled, rather than 
hardware-controlled. The advantage of our approach is that due to its much simpler hardware support, 
it does not run the risk of degrading single-thread performance in applications which cannot benefit from 
multithreading (e.g., those that do not contain parallel threads). For example, our scheme does not require 
any modifications to the register file, unlike hardware-controlled schemes which typically require a much 
larger register file (thereby increasing register access latencies). The primary hardware support required by 
our scheme is informing memory operations , which have klready been shown to be useful for a wide variety 
of purposes other than multithreading, and which are not expected to degrade single-thread performance. 

Our experimental results demonstrate that software-controlled multithreading can result in significant 
performance gains. In our baseline scheme, four of seven applications speed up by 10% or more, with one 
application speeding up by 16% (Fft). By judiciously applying register partitioning to reduce the thread 
switching overhead in cases where it does not result in excessive register spilling, we can enjoy even larger 
speedups: e.g., an overall speedup of 25% in the case of Fft. Since both remote latencies and the amount of 
remote communication are expected to increase with larger numbers of processors, we expect even greater 
performance gains on larger scale multiprocessors. 

As we look to the future, software-controlled multithreading should become even more attractive as 
instruction overhead becomes less and less expensive relative to memory latency. Software-controlled mul¬ 
tithreading is a gentle path to providing the performance benefits of multithreading when it matters the 
most, without biting off the full cost and overheads associated with hardware-controlled multithreading. 
The attractiveness of software-controlled multithreading provides another compelling reason for future mi¬ 
croprocessors to support informing memory operations. 


References 

[1] A. Agarwal, B.-H. Lim, D. Kranz, and J. Kubiatowicz. APRIL: A Processor Architecture for Multipro¬ 
cessing. In Proceedings of the 17th Annual International Symposium on Computer Architecture, pages 
104-114, May 1990. 

[2] R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield, and B. Smith. The Tera Computer 
System. In Proceedings of the International Conference on Supercomputing, pages 1-6, June 1990. 

[3] S. Dwarkadas, A. L. Cox, and W. Zwaenepoel. An Integrated Compile-Time/Run-Time Software Dis¬ 
tributed Shared Memory System. In Proceedings of the 7th International Conference on Architectural 
Support for Programming Languages and Operating Systems , October 1996. 

[4] K. Farkas and N. Jouppi. Complexity/performance tradeoffs with non-blocking loads. In Proceedings 
of the 21st International Symposium on Computer Architecture , pages 211-222, April 1994. 

[5] M. Horowitz, M. Martonosi, T. C. Mowry, and M. D. Smith. Informing Memory Operations: Providing 
Performance Feedback in Modern Processors. In Proceedings of the 23rd International Symposium on 
Computer Architecture , pages 260-270, May 1996. 

[6] M. Horowitz, M. Martonosi, T. C. Mowry, and M. D. Smith. Informing Memory Operations: Memory 
Performance Feedback Mechanisms and Their Applications. ACM Transactions on Computer Systems , 
16(2): 170—205, May 1998. 


13 



[7] D. Kroft. Lockup-free Instruction Fetch/Prefetch Cache Organization. In Proceedings of the 8th Inter¬ 
national Symposium on Computer Architecture, pages 81-85, 1981. 

[8] J. Laudon, A. Gupta, and M. Horowitz. Interleaving: A Multithreading Technique Targeting Multipro¬ 
cessors and Workstations. In Proceedings of the Sixth International Conference on Architectural Support 
for Programming Languages and Operating Systems, pages 308-318, October 1994. 

[9] J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA Highly Scalable Server. In Proceedings of the 
24 th International Symposium on Computer Architecture, pages 241-251, June 1997. 

[10] C. K. Luk and T. C. Mowry. Compiler-Based Prefetching for Recursive Data Structures. In Proceedings 
of the 7th International Conference on Architectural Support for Programming Languages and Operating 
Systems, pages 222-233, October 1996. 

[11] T. C. Mowry. Tolerating Latency Through Software-Controlled Data Prefetching. PhD thesis, Stanford 
University, March 1994. Technical Report CSL-TR-94-626. 

[12] T. C. Mowry, C. Q. C. Chan, and A. K. W. Lo. Comparative Evaluation of Latency Tolerance Tech¬ 
niques for Software Distributed Shared Memory. In Proceedings of the 4th IEEE Symposium on High- 
Performance Computer Architecture, pages 300-311, February 1998. 

[13] A. K. Porterfield. Software Methods for Improvement of Cache Performance on Supercomputer Appli¬ 
cations. PhD thesis, Department of Computer Science, Rice University, May 1989. 

[14] I. Schoinas, B. Falsafi, A. R. Lebeck, S. K. Reinhardt, J. R. Larus, and D. A. Wood. Fine-Grain 
Access Control for Distributed Shared Memory. In Proceedings of the 6th International Conference 
on Architectural Support for Programming Languages and Operating Systems, pages 297-306, October 
1994. 

[15] B. J. Smith. Architecture and applications of the HEP multiprocessor computer system. SPIE 298-241- 
248, 1981. 

[16] K. Thitikamol and P. Keleher. Multi-threading and Remote Latency in Software DSMs. In Proceedings 
of the 17th International Conference on Distributed Computing Systems, 1997. 

[17] D. M. Tullsen, S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, and R. L. Stamm. Exploiting Choice: 
Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor. In Proceedings 
of the 23rd International Symposium on Computer Architecture, pages 191-202, May 1996. 

[18] D. M. Tullsen, S. J. Eggers, and H. M. Levy. Simultaneous Multithreading: Maximizing On-Chip 
Parallelism. In Proceedings of ISC A 22, pages 392-403, June 1995. 

[19] M. E. Wolf and M. S. Lam. A data locality optimizing algorithm. In Proceedings of the SIGPLAN ’91 
Conference on Programming Language Design and Implementation, pages 30-44, June 1991. 

[20] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The splash-2 programs: Characterization 
and methodological considerations. In Proceedings of the 22nd International Symposium on Computer 
Architecture, pages 24-38, June 1995. 

[21] K. Yeager. The MIPS R10000 Superscalar Microprocessor. IEEE Micro, April 1996. 


14 



School of Computer Science 
Carnegie Mellon University 
Pittsburgh, PA 15213-3890 


Carnegie Mellon University does not discriminate and Carnegie Mellon University is required 
not to discriminate in admission, employment, or administration of its programs or activities 
on the basis of race, color, national origin, sex or handicap in violation of Title VI of the Civil 
Rights Act of 1964, Title IX of the Educational Amendments of 1972 and Section 504 of the 
Rehabilitation Act of 1973 or other federal, state, or local laws or executive orders. 

In addition, Carnegie Mellon University does not discriminate in admission, employment or 
administration of its programs on the basis of religion, creed, ancestry, belief, age, veteran 
status, sexual orientation or in violation of federal, state, or local laws or executive orders. 
However, in the judgment of the Carnegie Mellon Human Relations Commission, the Depart¬ 
ment of Defense policy of, “Don’t ask, don’t tell, don’t pursue,” excludes openly gay, lesbian 
and bisexual students from receiving ROTC scholarships or serving in the military. Neverthe¬ 
less, all ROTC classes at Carnegie Mellon University are available to all students. 

Inquiries concerning application of these statements should be directed to the Provost, 
Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213, telephone (412) 268- 
6684 or the Vice President for Enrollment, Carnegie Mellon University, 5000 Forbes Avenue, 
Pittsburgh, PA 15213, telephone (412) 268-2056. 

Obtain general information about Carnegie Mellon University by calling (412) 268-2000. 





